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Abstract 

Recent work on language modelling has 
shifted focus from count-based models to 
neural models. In these works, the words 
in each sentence are always considered in 
a left-to-right order. In this paper we show 
how we can improve the performance of 
the recurrent neural network (RNN) lan¬ 
guage model by incorporating the syntac¬ 
tic dependencies of a sentence, which have 
the effect of bringing relevant contexts 
closer to the word being predicted. We 
evaluate our approach on the Microsoft 
Research Sentence Completion Challenge 
and show that the dependency RNN pro¬ 
posed improves over the RNN by about 
10 points in accuracy. Furthermore, we 
achieve results comparable with the state- 
of-the-art models on this task. 

1 Introduction 

Language Models (LM) are commonly used to 
score a sequence of tokens according to its prob¬ 
ability of occurring in natural language. They are 
an essential building block in a variety of applica¬ 
tions such as machine translation, speech recogni¬ 
tion and grammatical error correction. The stan¬ 
dard way of evaluating a language model has been 
to calculate its perplexity on a large corpus. How¬ 
ever, this evaluation assumes the output of the lan¬ 
guage model to be probabilistic and it has been 
observed that perplexity does not always correlate 
with the downstream task performance. 


For these reasons, Zweig and Burges (2012 1 


proposed the Sentence Completion Challenge, in 
which the task is to pick the correct word to com¬ 
plete a sentence out of five candidates. Perfor¬ 
mance is evaluated by accuracy (how many sen¬ 
tences were completed correctly), thus both prob¬ 


et al. (20071) can be compared. Recent approaches 
for this task include both neural and count-based 
language models (Zweig et ah, 20 [Gubbins 


and Vlachos, 2013]|Mnih and Kavukcuoglu, 2013 

Mikolov et ah, 20f3| ). 

Most neural language models consider the to¬ 
kens in a sentence in the order they appear, and 
the hidden state representation of the network 
is typically reset at the beginning of each sen¬ 
tence. In this work we propose a novel neu¬ 
ral language model that learns a recurrent neu¬ 
ral network (RNN) ( [Mikolov et ah, 2010 1 on 
top of the syntactic dependency parse of a sen¬ 
tence. Syntactic dependencies bring relevant con¬ 
texts closer to the word being predicted, thus en¬ 


hancing performance as shown by Gubbins and 
Vlachos (201 3| l for count-based language models. 
Our Dependency RNN model is published simul¬ 


taneously with another model, introduced in Tai et 
al. (201 5| l, who extend the Long-Short Term Mem¬ 
ory (LSTM) architecture to tree-structured net¬ 
work topologies and evaluate it at sentence-level 
sentiment classification and semantic relatedness 
tasks, but not as a language model. 

Adapting the RNN to use the syntactic depen¬ 
dency structure required to reset and run the net¬ 
work on all the paths in the dependency parse tree 
of a given sentence, while maintaining a count of 
how often each token appears in those paths. Fur¬ 
thermore, we explain how we can incorporate the 
dependency labels as features. 

Our results show that the dependency RNN lan¬ 
guage model proposed outperforms the RNN pro¬ 


posed by Mikolov et al. (20111 by about 10 points 
in accuracy. Furthermore, it improves upon the 


count-based dependency language model of Gub- 


bins and Vlachos (201 3| l, while achieving slightly 
worse than the recent state-of-the-art results by 


Mnih and Kavukcuoglu (20131. Finally, we make 


abilistic and non-probabilistic models (e.g. Roark 


the code and preprocessed data available to facili¬ 
tate comparisons with future work. 






























2 Dependency Recurrent Neural 
Network 

Count-based language models operate by assign¬ 
ing probabilities to sentences by factorizing their 
likelihood into n-grams. Neural language mod¬ 
els further embed each word w{t) into a low¬ 
dimensional vector representation (denoted by 

s(i))[] 

These word representations are learned as the 


language model is trained (Bengio et ah, 20031 
and enable to define a word in relation to other 
words in a metric space. 

Recurrent Neural Network IMikolov et al.l 


(20101 suggested the use of Recurrent Neural Net¬ 
works (RNN) to model long-range dependencies 
between words as they are not restricted to a fixed 
confexf lengfh, like fhe feedforward neural nef- 


work (Bengio el ah, 20031. The hidden represenla- 
lion s(f) for fhe word in posifion t of fhe senfence 
in fhe RNN follows a firsf order aufo-regressive 
dynamic (Eq. [T]), where W is fhe mafrix connecf- 
ing fhe hidden represenfafion of fhe previous word 
s(f — 1) fo fhe currenl one, w(t) is fhe one-hof in¬ 
dex of fhe currenf word (in a vocabulary of size N 
words) and U is fhe mafrix conlaining fhe embed¬ 
dings for all fhe words in fhe vocabulary: 

s(f) = /(Ws(t-l)+Uw(t)) (1) 

The nonlinearily / is fypically fhe logisfic sigmoid 
funclion f{x) = ■ Al each lime step, fhe 

RNN generales fhe word probabilily vector y(f) 
for fhe nexl word w(f -I- 1), using fhe oulpul word 
embedding mafrix V and fhe soflmax nonlinearity 

g{Xi) — ^_exp(xi)' 


y{t) = g(Vs{t)) (2) 


RNN with Maximum Entropy Model Mikolov 


ef al. (20lT]| combined RNNs wifh a maximum en- 


fropy model, essenlially adding a mafrix lhaf di- 
reclly connecls fhe inpul words’ n-gram confexf 
w(f — n -|- 1,..., f) to fhe oulpul word proba- 
bililies. In pracfice, because of fhe large vocab¬ 
ulary size N, designing such a mafrix is compufa- 
fionally prohibilive. Instead, a hash-based imple- 
menlalion is used, where fhe word confexf is fed 


*In our notation, we make a distinction between the word 
token w{t) at position t in the sentence and its one-hot vector 
representation w(t). We note Wi the i-th word token on a 
breadth-first traversal of a dependency parse tree. 


Ihrough a hash funclion h lhaf compules fhe in¬ 
dex /i(w(f — n -|- 1,..., f)) of fhe confexf words 
in a one-dimensional array d of size D (typically, 
D = 10®). Array d is frained in fhe same way as 
fhe resl of fhe RNN model and conlribufes lo fhe 
oulpul word probabililies: 

y{t) = g (vs(f) -b (3) 


As we show in our experimenls, fhis addilional 
mafrix is crucial lo a good performance on word 
complelion lasks. 

Training RNNs RNNs are frained using maxi¬ 
mum likelihood Ihrough gradienl-based oplimiza- 
lion, such as Sfochaslic Gradienl Descenf (SGD) 
wifh an annealed learning rale A. The Back- 
Propagafion Through Time (BPTT) varianl of 
SGD enables lo sum-up gradienls from consecu- 
live lime sleps before updafing fhe paramelers of 
fhe RNN and fo handle fhe long-range temporal 
dependencies in fhe hidden s and oulpul y se¬ 
quences. The loss funclion is fhe cross-enlropy 
befween fhe generaled word dislribufion y(t) and 
fhe largel one-hot word dislribufion w(f -(- 1), and 
involves fhe log-likelihood terms log yw{t+i){i)- 
For speed-up, fhe eslimafion of fhe oulpul word 
probabililies is done using hierarchical soflmax 


oulpuls, i.e., class-based faclorizalion (Mikolov 
and Zweig, 20121. Each word w'^ is assigned to 
a class c® and fhe corresponding log-likelihood is 
effeclively \ogy^i{t) = \ogy^i{t) + \ogy^j{t), 
where j is fhe index of word m® among words 
belonging fo class c®. In our experimenls, we 
binned fhe words found in our fraining corpus info 
250 classes according to frequency, roughly corre¬ 
sponding lo fhe square roof of fhe vocabulary size. 

Dependency RNN RNNs are designed lo pro¬ 
cess sequenfial dala by ilerafively presenfing Ihem 
wifh word w(t) and generaling nexl word’s proba- 
bilily dislribufion y (t) al each lime slep. They can 
be resel al fhe beginning of a sentence by selling 
all fhe values of hidden vector s(f) lo zero. 

Dependency parsing ( Nivre, 2005| ) generales, 
for each senfence (which we note {w{t)}J^Q), a 
parse free wifh a single roof, many leaves and an 
unique palh (also called unroll) from fhe roof lo 
each leaf, as illuslraled on Figure [T] We now nole 
{wi}i fhe sel of word lokens appearing in fhe parse 
free of a senfence. The order in fhe nolafion de¬ 
rives from fhe breadlh-firsl Iraversal of lhaf free 



















pobj 


ROOT I saw the ship with very strong binoculars 

Figure 1: Example dependency tree 


the state sj that is used to generate the distribution 
of words Wj (where j is the parent of i in the tree), 
represents the vector embedding of the history of 
the ancestor words A{wi). Therefore, we count 
the term P[wj|sj] only once when computing the 
likelihood of the sentence. 




(i.e., the root word is noted ruo). Each of the un¬ 
rolls can be seen as a different sequence of words 
{wi}, starting from the single root wq, that are vis¬ 
ited when one takes a specific path on the parse 
tree. We propose a simple transformation to the 
RNN algorithm so that it can process dependency 
parse trees. The RNN is reset and independently 
run on each such unroll. As detailed in the next 
paragraph, when evaluating the log-probability of 
the sentence, a word token Wi can appear in mul¬ 
tiple unrolls but its log-likelihood is counted only 
once. During training, and to avoid over-training 
the network on word tokens that appear in more 
than one unroll (words near the root appear in 
more unrolls than those nearer the leaves), each 
word token Wi is given a weight discount a* = ^, 
based on the number m of unrolls the token ap¬ 
pears in. Since the RNN is optimized using SGD 
and updated at every time-step, the contribution of 
word token Wi can be discounted by multiplying 
the learning rate by the discount factor: a* A. 


Sentence Probability in Dependency RNN 

Given a word Wi, let us define the ancestor se¬ 
quence A{wi) to be the subsequence of words, 
taken as a subset from {wkY^^Q describing 
the path from the root node wq to the parent of Wi. 
Eor example, in Eigure[^ the ancestors A(very) 
of word token very are saw, binoculars and 
strong. Assuming that each word Wi is con¬ 
ditionally independent of the words outside of 
its ancestor sequence, given its ancestor sequence 
A{wi), Gubbins and Vlachos (20131 showed that 
the probability of a sentence (i.e., the probability 
of a lexicalized tree 5^ given an unlexicalized tree 
T) could be written as: 


|S| 

P[S^\T] = Y[P[wi\A{wi)] (4) 

i=l 

This means that the conditional likelihood of a 
word given its ancestors needs to be counted only 
once in the calculation of the sentence likelihood, 
even though each word can appear in multiple un¬ 
rolls. When modeling a sentence using an RNN, 


3 Labelled Dependency RNN 


The model presented so far does not use 
dependency labels. Eor this purpose we 


adapted the context-dependent RNN (Mikolov and 


Zweig, 2012 1 to handle them as additional M- 
dimensional label input features f (t). These fea¬ 
tures require a matrix F that connects label fea¬ 
tures to word vectors, thus yielding a new dynam¬ 
ical model (Eq. in the RNN, and a matrix G 
that connects label features to output word proba¬ 
bilities. The full model becomes as follows: 


s{t) = / (Ws(f - 1) + Uw(f) + Ff(f))(5) 
y{t) = 5 (Vs(f)-hGf(f)-hd;,(^t_^_^^))( 6 ) 

On our training dataset, the dependency parsing 
model found M = 44 distinct labels (e.g., nsubj, 
det or prep). At each time step t, the context word 
w(f) is associated a single dependency label f(f) 
(a one-hot vector of dimension M). 

Eet G{w) be the sequence of grammatical rela¬ 
tions (dependency tree labels) between successive 
elements of {A{w),w). The factorization of the 
sentence likelihood from Eq. [^becomes: 

|S| 

P[S^\T] = n P[wi\Aiwi),G{wi)] (7) 

i=l 

4 Implementation and Dataset 

We modified fhe Eeafure-Augmenfed RNN 
foolkij^ and adapfed if fo handle free-sfrucfured 
dafa. Specifically, and insfead of being run se- 
quenfially on fhe enfire framing corpus, fhe RNN 
is run on all fhe word fokens in all unrolls of all 
fhe senfences in all fhe books of fhe corpus. The 
RNN is resef af fhe beginning of each unroll of a 
senfence. When calculafing fhe log-probabilify of 
a senfence, fhe confribufion of each word token 
is counfed only once (and sfored in a hash-fable 
specific for fhat sentence). Once all the unrolls 
of a sentence are processed, the log-probability 
of the sentence is the sum of the per-token log- 
probabilities in that hash-table. We also further 

^ http://research.microsoft.com/en-us/projects/mn/ 








enhanced the RNN library by replacing some 
large matrix multiplication routines by calls to the 
CBLAS library, thus yielding a two- to three-fold 
speed-up in the test and training time0 

The training corpus consists of 522 19th cen¬ 
tury novels from Project Gutenberg ( |Zweig and 
Burges, 2012| |. All processing (sentence-splitting, 
PoS tagging, syntactic parsing) was performed us¬ 
ing the Stanford CoreNLP toolkit 
\10\A) . The test set contains 1040 
completed. Each sentence consists of one ground 
truth and 4 impostor sentences where a specific 
word has been replaced with a syntactically cor¬ 
rect but semantically incorrect impostor word. De¬ 
pendency trees are generated for each sentence 
candidate. We split that set into two, using the first 
520 sentences in the validation (development) set 
and the latter 520 sentences in the test set. Dur¬ 
ing training, we start annealing the learning rate A 
with decay factor 0.66 as soon as the classification 
error on the validation set starts to increase. 

5 Results 


Architecture 

50h 

lOOh 

200h 

300h 

RNN (dev) 

29.6 

30.0 

30.0 

30.6 

RNN (test) 

28.1 

30.0 

30.4 

28.5 

RNN+2g (dev) 

29.6 

2&7 

29.4 

29.8 

RNN-i-2g (test) 

29.6 

28.7 

28.1 

30.2 

RNN+3g (dev) 

39.2 

39.4 

38.8 

36.5 

RNN-i-3g (test) 

40.8 

40.6 

40.2 

39.8 

RNN+4g (dev) 

40.2 

40.6 

40.0 

40.2 

RNN-i-4g (test) 

42.3 

41.2 

40.4 

39.2 


Table 1: Accuracy of sequential RNN on the MSR 
Sentence Completion Challenge. 


Architecture 

50h 

lOOh 

200h 

depRNN+3g (dev) 

53.3 

54.2 

54.2 

depRNN-i-3g (test) 

51.9 

52.7 

51.9 

ldepRNN+3g (dev) 

48.8 

51.5 

49.0 

ldepRNN-i-3g (test) 

44.8 

45.4 

47.7 

depRNN+4g (dev) 

52.7 

54.0 

52.7 

depRNN-i-4g (test) 

48.9 

51.3 

50.8 

IdepRNN+4g (dev) 

49.4 

50.0 

(48.5) 

ldepRNN-i-4g (test) 

47.7 

51.4 

(47.7) 


(Manning et ah, 
sentences to be 


Table [T] shows the accuracy (validation and test 
sets) obtained using a simple RNN with 50, 100, 
200 and 300-dimensional hidden word represen¬ 
tation and 250 frequency-based word classes (vo¬ 
cabulary size N = 72846 words appearing at least 
5 times in the training corpus). One notices that 
adding the direct word context to target word con¬ 
nections (using the additional matrix described in 
section 2), enables to jump from a poor perfor¬ 
mance of about 30% accuracy to about 40% test 
accuracy, essentially matching the 39% accuracy 
reported for Good-Turing n-gram language mod¬ 
els in Zweig et al. (2012] ). Modelling 4-grams 
yields even better results, closer to the 45% accu¬ 
racy reported for RNNs in ( Zweig et ah, 20 12] !^] 

As Table shows, dependency RNNs (de- 
pRNN) enable about 10 point word accuracy im¬ 
provement over sequential RNNs. 

The best accuracy achieved by the depRNN on 
the combined development and test sets used to re¬ 
port results in previous work was 53.5%. The best 
reported results in the MSR sentence completion 
challenge have been achieved by Log-BiLinear 
Models (LBLs) ( |Mnih and Hinton, 2007| ), a vari- 


^Our code and our preprocessed datasets are avail¬ 
able from: https :/ / git hub . com/piotrmirowski/ 
DependencyTreeRnn 

‘‘The paper did not provide details on the maximum en¬ 
tropy features or on class-based hierarchical softmax). 


Table 2: Accuracy of (un-)labeled dependency 
RNN (depRNN and IdepRNN respectively). 


ant of neural language models with 54.7% to 


55.5% accuracy ( |Mnih and Teh, 2012t |Mnih and 
Kavukcuoglu, 201^ . We conjecture that their su¬ 
perior performance might stem from the fact that 
LBLs, just like n-grams, take into account the or¬ 
der of the words in the context and can thus model 
higher-order Markovian dynamics than the simple 
first-order autoregressive dynamics in RNNs. The 
depRNN proposed ignores the left-to-right word 
order, thus it is likely that a combination of these 
approaches will result in even higher accuracies. 


Gubbins and Vlachos (20131 developed a count- 


based dependency language model achieving 50% 
accuracy. Linally, Mikolov et al. (2013| ) report that 
they achieved 55.4% accuracy with an ensemble of 
RNNs, without giving any other details. 


6 Discussion 


Related work Mirowski et al. (2010| | incorpo¬ 
rated syntactic information into neural language 
models using PoS tags as additional input to LBLs 
but obtained only a small reduction of the word 
error rate in a speech recognition task. Similarly, 


Bian et al. (20141 enriched the Continuous Bag-of- 















































Words (CBOW) model of |Mikolov et al. (2013| ) 
by incorporating morphology, PoS tags and en¬ 
tity categories into 600-dimensional word embed¬ 
dings trained on the Gutenberg dataset, increas¬ 
ing sentence completion accuracy from 41% to 
44%. Other work on incorporating syntax into lan¬ 
guage modeling include [Chelba et al. (1997|) and 


Pauls and Klein (20121, however none of these ap¬ 


proaches considered neural language models, only 
count-based ones. Levy and Goldberg (20141 and 
Zhao et al. (2014|) proposed to train neural word 


embeddings using skip-grams and CBOWs on de¬ 
pendency parse trees, but did not extend their ap¬ 
proach to actual language models such as LBL and 
RNN and did not evaluate the word embeddings 
on word completion tasks. 

Note that we assume that the dependency tree 
is supplied prior to running the RNN which limits 
the scope of the Dependency RNN to the scoring 
of complete sentences, not to next word prediction 
(unless a dependency tree parse for the sentence 
to be generated is provided). Nevertheless, it is 
common in speech recognition and machine trans¬ 
lation to use a conventional decoder to produce an 
N-best list of the most likely candidate sentences 
and then re-score them with the language model. 
( Chelba et al., 1997) Pauls and Klein, 2011[ ) 


Tai et al. (20151 propose a similar approach to 


ours, learning Long Short-Term Memory (LSTM) 
(|Hochreiter and Schmidhuber, 1997 Graves, 


20121 RNNs on dependency parse tree network 


topologies. Their architectures is not designed to 
predict next-word probability distributions, as in 
a language model, but to classify the input words 
(sentiment analysis task) or to measure the sim¬ 
ilarity in hidden representations (semantic relat¬ 
edness task). Their relative improvement in per¬ 
formance (tree LSTMs vs standard LSTMs) on 
these two tasks is smaller than ours, probably be¬ 
cause the LSTMs are better than RNNs at storing 
long-term dependencies and thus do not benefit 
form the word ordering from dependency trees as 
much as RNNs. In a similar vein to ours, |Miceli- 


Barone and Attardi (20151 simply propose to en¬ 


hance RNN-based machine translation by permut¬ 
ing the order of the words in the source sentence to 
match the order of the words in the target sentence, 
using a source-side dependency parsing. 


Limitations of RNNs for word completion 

Zweig et al. (2012] ) reported that RNNs achieve 
lower perplexity than n-grams but do not always 



Figure 2: Perplexity vs. accuracy of RNNs 


outperform them on word completion tasks. As 
illustrated in Fig. the validation set perplex¬ 
ity (comprising all 5 choices for each sentence) 
of the RNN keeps decreasing monotonically (once 
we start annealing the learning rate), whereas the 
validation accuracy rapidly reaches a plateau and 
oscillates. Our observation confirms that, once an 
RNN went through a few training epochs, change 
in perplexity is no longer a good predictor of 
change in word accuracy. We presume that the 
log-likelihood of word distribution is not a train¬ 
ing objective crafted for precision®!, and that 
further perplexity reduction happens in the middle 
and tail of the word distribution. 

7 Conclusions 

In this paper we proposed a novel language model, 
dependency RNN, which incorporates syntactic 
dependencies into the RNN formulation. We eval¬ 
uated its performance on the MSR sentence com¬ 
pletion task and showed that it improves over 
RNN by 10 points in accuracy, while achieving re¬ 
sults comparable with the state-of-the-art. Further 
work will include extending the dependency tree 
language modeling to Long Short-Term Memory 
RNNs to handle longer syntactic dependencies. 
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